Our estimate for \(\mu\) is unbiased but has a fairly large variance.
I have preliminary results showing that our method of averaging results over all SNPs works when the true effect at the CV (\(\mu\)) is known and used in \(Z_j\).
I check these results using UK10K data and low, medium and high LD regions using the 1000 genomes data.
These results are running in the 24july/knownmu/ directory.
Number of simulations making up each bin:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8]
## 262 453 264 310 665 436 665 666
## (8,13] (13,16] (16,Inf]
## 680 254 270
Now for the high/medium/low LD regions using 1000 genomes data:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8] (8,13]
## low 35 299 126 92 215 168 200 158 223
## high 23 265 171 68 205 153 187 176 259
## medium 28 257 181 94 189 189 189 169 250
##
## (13,16] (16,Inf]
## low 44 94
## high 33 104
## medium 32 115
Problem: Our correction method is not as accurate in regions of high LD.
Ideas:
As LD increases, as does the difficulty in seperating out signals.
Higher values of \(\mu\) help to seperate out the true effect, as the seperability between the CV and the other SNPs is increased.
Factors affecting seperability: LD, power (sample size and MAFs).
## mu_obs muhat true claimed corrected
## 1 18.66619 18.65872 0.9812 0.9574063 0.9224276
Let’s now consider the effect of LD and effect size on the seperability (i.e. how well the true signal can be picked out).
Possible measures of LD:
maxr2)meanr2)medianr2)r20.2 or r20.1 etc.)Possible measures of seperability:
nvar_cs)PP_iCV)To visualise this, I plot \(\mu\) against the LD for many simulations, with contours added reflecting the seperability.
I then take vertical slices of the plots on the right hand side (for low, medium and high LD) to investigate the relationship between the value of \(\mu\) and the seperability of the signals. We are looking to see if there is a way we can model this relationship.
It is perhaps better if we look at the logit of the seperability.
For this I focus on the “max \(r^2\) with the CV” LD measure.
Method:
logit(PP_iCV) column to original data (removing any Inf rows)loess(logit_PP_iCV ~ maxr2 * true.mu, data = data)maxr2 and true.mu)0.3<maxr2<0.35)In high LD it is harder to seperate out the signals, and a larger effect size at the CV may help with this. This is why in our proportion covered plots, lots of these proportions are lower than the threshold (the PP doesn’t get allocated to the CV correctly - may be spread amongst other variants in high LD).
To investigate this, let’s plot the proportion of simulated credible sets (where the CV is assumed causal) that contain the CV (propcov_cv) against our 3 different measures of LD discussed above.
Indeed, my results show that there is a negative relationship between the amount of LD and the empirical coverage proportions where the true CV is assumed causal –> thus our corrected coverage estimate is consistently too low in high LD regions.
Explanation: In our correction, we are averaging something that has an upper bound of 1 (the empirical coverage proportions), where lots of the data points lie close to the upper bound. This means that it is “easier” for the empirical coverage proportions to be too low than too high, and lower estimates are more detrimental than higher estimates. This may explain why our corrected coverage estimate is consistently too low in high powered scenarios.
Potential fix would be to take the logit of the prop_cov quantities used in the final corrected coverage estimate weighting, so that they are no longer bounded.
Our final corrected coverage estimate is then given by:
logit <- function(x) log(x/(1-x))
invlogit <- function(x) exp(x)/(1 + exp(x))
corrcov_tmp <- sum(pp0 * logit(prop_cov))
corrcov <- invlogit(corrcov_tmp)
UK10K haplotypes (7562 of these) for Z scores and MAFs, 1000 Genomes haplotypes (1005 of these) for LD.
For each simulation, I calculate the corrected coverage using the “true uk10k” haplotypes and also using 1000 Genomes reference haplotypes.
Method:
simGWAS to generate the system to be correctedThe number of simulations in each bin:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8]
## 237 390 259 277 565 461 615 575
## (8,13] (13,16] (16,Inf]
## 665 238 258
Read “BAYESIAN MODEL SEARCH AND MULTILEVEL INFERENCE FOR SNP ASSOCIATION STUDIES” paper.
In the paper, do I need to write down the maths of the Maller method? I.e. page 9 of my report.
Surely an ‘MAF’ cannot be greater than 0.5 –> why in our simulations do we select a CV with 0.05<MAF<0.95?
\(\mu\) is the Z score at the CV, Paul said that this is not equivalent to the true effect. Specifically, the Z score is the true effect if the outcome is normalised and the predictor is normalised?
Paul and Jenn thought it was confusing to introduce PP’ (including the null) to derive \(\hat\mu\) but then go back to normal PPs for the correction - can we really not just use max(abs(Z)) for \(\mu\)?